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Analysis of the geometric properties of a mean-field HP model on a square lattice for protein 
structure shows that structures with large number of switch backs between surface and core sites 
are chosen favorably by peptides as unique ground states. Global comparison of model (binary) 
peptide sequences with concatenated (binary) protein sequences listed in the Protein Data Bank 
and the Dali Domain Dictionary indicates that the highest correlation occurs between model peptides 
choosing the favored structures and those portions of protein sequences containing alpha-helices. 



PACS number: 87.10. +e, 87.15.-v, 87.15.By 

The three-dimensional structure of proteins is a com- 
plex physical and mathematical problem of prime im- 
portance in molecular biology, medicine and pharmacol- 
ogy ID]. It is believed that the folding instruction of a 
protein is encoded in its amino acid sequence (|] and 
from model studies much has been learned about pro- 
tein structure and folding kinetics Yet much still 
remains to be understood. This simple fact is already in- 
triguing: the number of possible globular structures for 
a peptide of typical length - about 300 amino acids - is 
practically infinite; the number of proteins whose struc- 
tures are known empirically or hypothetically is more 
than a hundred thousand and is growing rapidly with 
time; the number of classes of native protein structures 
is about five hundred and is believed unlikely to exceed 
a thousand in the long run QJ^]. Numerical simulations 
based on lattice models have shown that structures of ex- 
ceptionally high designability - those that attract a large 
number of protein sequences to conform to it - do exist 
HIHII ■ Why such structures would emerge is however not 
well understood. Protein folding also has an outstanding 
temporal feature: the initial collapse to globular shape 
and the formation of a-helices are completed in less than 
10~ 7 seconds plj, while the rest of the folding takes up 
to ten seconds to complete. 

In this report, based on results from a mean- field lat- 
tice model we observe that structures with high des- 
ignability are preponderant in a type of substructure that 
suggests a-helices in real proteins and we explain the rea- 
son for this phenomenon. This notion is supported by 
global comparisons of model structural sequences with 
(binary) sequences constructed from sets of proteins of 
known structure: the Protein Data Bank (PDB) S and 
the Dali Domain Dictionary (DDD) |Q. Since the mean- 
field in the model represents the hydrophobic potential 
that is known to cause the initial collapse of a peptide 
to a globular shape, the results may explain why the 
initial collapse and the formation of a-helices occur es- 
sentially simultaneously and rapidly, and are temporally 
separated from other slower folding processes that are 



driven by far- neighbor intcr-rcsidual interactions. 

In the minimal model for protein folding, the HP model 
of Dill et al. JjJ, the 20 kinds of amino acids are di- 
vided into two types, hydrophobic and polar. This re- 
duces a peptide chain of length N to a binary "pep- 
tide" p = (pi,p2, ■ • ■ ,Pn)i where p, =0 (1) if the amino 
acid at the ith position on the chain is polar (hydropho- 
bic) . The structure of a protein is represented by a self- 
avoiding path compactly embedded on a lattice C, and 
the energy associated with a peptide conforming to a 
particular structure is computed from the contact ener- 
gies between the nearest-neighbor residues that are not 
adjacent along the peptide. A set of well tested con- 
tact energies derived from proteins of known structure is 
the Miyazawa- Jernigan matrix, which is however well ap- 
proximated by an effective mean-field potential express- 
ing the hydrophobicities of the residues p| . In the bi- 
nary form of this approximation the Hamiltonian of the 
HP model is reduced to that of a mean-field model: 

iJ(p,s) = -p.s = i(|s-p| 2 -p 2 -s 2 ) (1) 

where s = (si, S2, • ■ ■ , sjv) is a binary "structure" con- 
verted from a self- avoiding path with the assignment: 
Sj = 1 (0) if the ith site is a core (surface) site on the lat- 
tice PJl2||. Empirical observation suggests that protein 
folding proceeds in two steps, a first stage of fast collapse 
and formation of alpha-helices (and probably some not 
properly folded beta-sheets) presumably caused mainly 
by hydrophobic interactions under polymeric constraints, 
followed by a second stage of slow annealing caused by 
far-neighbor inter-residue interactions that gives the final 
native state |lO| , |l3T| . Since Eq.(Q) is a local, mean-field ap- 
proximation that leaves out residual - i.e., left over from 
mean- field averaging - far- neighbor interactions, it can 
be relied on to account for only the first stage. 

We denote by S the set of all distinct structures s on 
C and by V the set of all possible peptides p of length TV. 
For each p the selection of the s giving the minimum H 
defines a mapping from V to S. There are p's that are 



mapped to more than one s's or are mapped to s's that 
correspond to more than one self-avoiding paths. Such 
p's are removed from V and their target s's are removed 
from the competition for high designability, because a 
peptide that does not conform to a unique structure at 
all times is not expected to survive the evolutionary se- 
lection process jL4|. It has also been shown that not 
admitting degenerate states in a coarse-grained model is 
similar to removing peptides that has low foldability in 
a finer-grained model (Many states that are de- 

generate in the present coarse-grained model would in a 
model with higher energy resolution be states of different 
symmetries with nearly degenerate energies. The energy 
landscape for the ground state among these would likely 
contain deep local minima and a peptide choosing such 
a ground state would likely be a poor folder.) The map- 
ping then partitions what remains in V into classes, with 
all the p's in each class mapped to a single s, whose des- 
ignability is simply the number of p's in the class. For 
this mapping the right-hand-side of Eq. ([!]) reduces to be- 
ing proportional to |s — p| 2 , the Hamming distance the 
two points p and s in an TV-dimensional unit hypercubc. 
Then the designability of an s is equal to the Voronoi 
polytope around it in this hypercube jl2j . 

Excepting those removed for degeneracy V is just the 
set of all the vertices - on a unit hypercubc. In com- 
parison, owing to the constraints of compactness and 
self- avoidance imposed on paths on £, points in S are 
sparsely distributed in the hypercube so that S C V . 
For example, on a 6 x 6 square lattice, the number of 
elements in (including those to be removed for degenera- 
cies) V is 2 36 = 68719476736, while those in S is 30408 
(but only 18213 of them have no path degeneracy). If 
the points in S were uniformly distributed in the hyper- 
cube, then the Voronoi polytope around each s would be 
the same and every s would have the same designability. 
But owing to boundary effects and geometric constraints 
imposed on the compact paths on C, the distribution of 
s's in S cannot be uniform, those s's residing in regions 
in the hypercube that are of especially low density (in 
s's) will then have especially high designability. 

We now examine how geometric constraints cause the 
emergence of s's with especially high designabilities by 
first attempting to replace the constraints by a set of 
explicit algebraic "rules". Consider a structure in S to 
be a chain of 0's and l's linked by N — 1 links of three 
types, 0-0, 1-0 or 0-1, 1-1, with noo, «io and nil being 
the numbers of such links, respectively. The structure 
is partitioned by the 1-0 links into uiq + 1 "islands" of 
contiguous l's or 0's. (Peptides in V may be similarly 
described, but the only constraint on any p is that the 
total number of 0's and l's be N.) For C being a square 
lattice with side L, two of the most important constrain- 
ing rules are: (i) A single may only occur at an end of 
a path; (ii) An isolated single 1 may only either occur 
at or be one 0-island away from an end of a path. Space 



does not allow us to give more than another relatively 
simple example (with L > 4) : For a path having the pat- 
tern s = (1 • • • 1) (both the ends of the path are 1-sites), 
2noo + nio = 8L — 8 and 2 < n±o < 4L — 12. It is in 
fact extremely difficult if not impossible to exhaust the 
complete set of such rules needed to reduce V to S. For 
our purpose it suffices to identify a large enough set of 
rules which reduces V to a S' that is a sufficiently close 
to S for us to understand the origin and characteristics 
of structures of high designability. 
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FIG. 1. (a) Number of peptides vs. nio for V (open circle), 
Va (solid triangle), and Vs (square), and number of structures 
vs. nio for S (solid circle) and S (open triangle), on the 6x6 
lattice, (b) Smallest Hamming distance vs. differences of nio 
of all the 30408 paths in V on the 6x6 lattice. Average 
designabilities of the paths vs. nio for the (c) 4 x 7 and (d) 
6x6 lattices, respectively. Number of neighboring structures 
within a Hamming distance of (e) Rh=5 and (f) Rh=25 of a 
structure of given designability. 

In Fig .0(a) the number of elements in "P(open circle), 
S' (solid circle) and S (open triangle) on a 6 x 6 lattice 
are respectively plotted against nio. The total number 
of elements under the curve for V gives the total num- 
ber of sites in the hypercube. S' is slightly greater than 
S but is much smaller than V . (The boundary of S' 
owes its roughness to the incompleteness of the set of 
rules used to construct it.) It is seen that whereas for 
V the maximum possible value for mo is 4L — 4 = 32, 
for S and S' the corresponding maximum is much less: 
n ma x = 14. As nio approaches n max from below, the 
number of elements in 5' decreases rapidly whereas those 



in V increases toward a maximum. It happens that in 
the hypercube the smallest Hamming distance between 
two structures is approximately proportional to the dif- 
ference in their respective nio numbers. This is evident in 
Fig.[l](b) , where the smallest Hamming distance is plotted 
against the difference in mo for all the pairs among the 
30408 binary structures on a 6 x 6 lattice, and is consis- 
tent with results given in [fl7f in which x(p) (the degree of 
clustering of hydrophobic residues) is analogous to n^. 

Since allowed structures with nio's having values close 
to n max live in a region of the hypercube that is also 
most heavily populated by peptides, it follows that they 
would on average have a large Voronoi polytope, and 
hence are most likely to have the highest designabilitics. 
This is substantially borne out by the results shown in 
Figs.|l|(c) and (d) computed for the allowed structures 
on the 4x7 and 6x6 lattices, respectively, where av- 
erage designability is plotted against n 10 . The average 
designability does not exactly peak at nio = n max but 
rather at rtio's just less than n max . Why this should be 
so is not yet clearly understood. Structures with maxi- 
mum nio are the most constrained and are very few in 
number so that otherwise secondary details might have 
had a larger effect on their designabilities. Preference 
for large nio's has also been observed in other 2D and 
3D lattices. The relation between high designability and 
sparse population is further illustrated in Figs.|l|(e) and 
(f), where the number of structures within a Hamming 
distance Rh of a given structure is plotted against the 
designability of that structure. In (e), where Rh=5, it 
is seen that structures with high designability have far 
fewer near neighbors than structures with low designabil- 
ity. In (f), where -Rf/=25, it is seen that all structures 
have approximately the same large numbers of near and 
far neighbors. 

Now something interesting emerges. A structure with 
its nio (almost) maximized but not allowed to have single 
l's or 0's except at its ends (rules (i) and (ii)) will have a 
preponderance of the 4-mer (1100) in the interior, so that 
large stretches of it will have the form (••• 11001100 •• •) 
which suggests the linear structure of ct-helices on a lat- 
tice. A corollary is that structures with core to surface 
ratios close to unity are favored by designability. This 
implies a diameter of approximately 10 residues for an 
ideal protein, which is consistent with the typical size of 
300 to 1000 amino acids in natural proteins. Note that 
the selection of structures with maximized nio is a con- 
sequence of the geometric property of the Hamiltonian 
(|l|) in hyperspace and does not depend on the specifics 
of a lattice. That larger nio's are favored is a notion 
qualitatively consistent with the conclusion drawn from 
recent studies on folding kinetics that optimal structures 
are also minimally frustrated 18|. 

To see if what we have observed so far has anything to 
do with real proteins we compare five sequences, "Pi -5, 
each being a concatenation of a set of real protein or 



(6 x 6) lattice binary peptides: Vi, (a) the represen- 
tative non-redundant 2886 proteins (sequence similarity 
smaller than 90%) |H| culled from the 9257 entries in 
PDB ||, or (b) the even less redundant set of 1394 en- 
tries of protein domains from DDD Q], converted to bi- 
nary sequences based on hydrophobicity (|(]] ; V% , the sec- 
tions in V\ that fold into a-helices; V3, the sections in 
V\ that fold into /3-sheets; V4, the 27006 peptides in V 
mapped to the 15 structures of the highest designabili- 
ties; V5, the 24134 peptides in V mapped to the 1545 
structures of the lowest designabilities. Interestingly the 
H/P ratios of the five sequences are all very close to 1; the 
percentage of hydrophobic residues contained in each is 
respectively 50.00%, 49.75%, 56.18%, 50.43% and 49.05% 
for V\ (from PDB) through V5. FigJ^(a) shows that nei- 
ther distribution of peptides in Va (solid triangle) and V5 
(open square) vs. nio is random, which corroborates the 
results of [Q. In particular, in V5 (V4) peptides with 
larger (smaller) nio's are slightly favored over those with 
smaller (larger) nio's. 
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FIG. 2. Overlap of frequency distribution functions of lat- 
tice peptides and PDB proteins (a) and DDD protein domains 

-,(0 
'15 
->(') 



(b) as a function of word length I: ofl (open circle), O 



(solid circle), O2I (open triangle), O^l (solid triangle), O34 
(open square), Ogg (solid square) and 4 ' 5 (open diamond). 

Let (m) be the frequency, normalized to that of a 
sequence with an H/P ratio of unity (if the word has n# 
H's and the actual frequency of the word is /, then the 
normalized frequency is (jiff/np)"" H /), of the mth bi- 
nary word of length I occurring in sequence Vi and let 
F^ l \m) — {ff\m)~ fi)/Z be the normalized frequency 



distribution function, where / 



(0 



2-'E m /f ) (m) is 



the mean frequency and Z = (X) m (/i^( m ) ~ /f) 2 ) 1 ^ 2 



is the norm. The relations ^ F i ( 



and 

M _ 



Sm(^i ( m )) = 1 hold. The pairwise overlaps 

ELi^V)^fV); * = 1=2,3; j = 4,5 that mea- 
sure correlations between Vi and Vj for Z=4-14 are given 
in Figs.||(a) and (b), where the real protein sequences 
used are from PDB and DDD, respectively. The two 
sets of overlaps are qualitatively similar. It is seen that 
V^ (V5) is positively (negatively) correlated with V\ and 



7^2 ■ For all values of I the strongest correlation occurs 
between the model sequence of high designability (V4) 
and the real protein sequence rich in a- helices (7^2 )• The 
sequence of high designability is poorly correlated with 
the sequence rich in /3-sheets (V3) and, as expected, the 
strongest anti-correlation occurs between the two model 
sequences of high (V4) and low (V5) designabilities. 

Even though the favoring of surface-core repeats by 
peptides folding into high-designability structures is most 
likely not lattice specific (provided the H/P ratio is close 
to one), the particular choice of the (1100) repeat has 
the characteristic of a square lattice. For instance, on 
a hexagonal lattice the predominant repeat would more 
likely be (10) rather than (1100). There is some justifi- 
cation for selecting a square lattices over a hexagonal be- 
cause in real proteins the backbone does not favor small- 
angle bends. On the other hand, real proteins do not live 
on lattices and the equivalent of (10) repeats does occur 
in real proteins where /3-sheets are exposed to solvent. 
Thus the low correlation between V 3 and V4 is to some 
extent an artifact of the square lattice, and it may be 
better to interpret the (1100) repeats on a square lattice 
as representing a type and some (3 type repeats (but not 
the latter's foldings) in real proteins. 

Our study suggests that the rough formation of a- 
helices and some /3-sheets and the collapse of proteins into 
globular shapes are primarily determined by hydropho- 
bicity. Since only the mean-field part of the inter-residue 
interaction is included in the model, this implies that de- 
tails of the residual inter-residue interaction that deter- 
mine the final shape of the native state are not important 
at this stage. It has been pointed out that structures 
of high designability in a lattice model with two-letter 
amino acid alphabet may not be especially designable 
for higher-letter alphabets (T^|. Although the situation 
may be different on a lattice larger than the 5x5 lat- 
tice used in it does remain to be verified whether 
our findings persist in finer-grained and more realistic 
models. If it does, then we can better understand why 
the formation of a-helices and the collapse would happen 
on a similar time scale, of the order 10 _7 s, why the for- 
mation of /3-sheets would take somewhat longer (about 
10~ 6 s) [^0|,^3), and why these time scales would be so 
much shorter than the time needed to complete the rest 
of the folding (10 _1 s to 10s). This scenario is in any case 
consistent with the finding in a recent statistical analysis 
of experimental data: local contacts play the key role in 
fast processes during folding J22| . 
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